Practical Statistics

Author: Yang Long
Email: longyang_123@yeah.net

  • Statistical concepts
  • Sample Representation
  • Big sample distribution
  • Small sample distribution
  • Whole analysis and inference
  • Hypothesis verification
  • Element analysis and principal component analysis
  • Missing values
  • Time Series analysis

Statistical Concepts

Mean

For distribution, $$ \mu = \frac{1}{n} \sum_{i=1}^{n} x_i $$ For samples $\{x_i\}$, $$ \overline{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $$

Variance

For distribution, $$ \sigma^2 = \frac {1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 $$

For samples $\{x_i\}$, $$ \sigma^2 = \frac {1}{n} \sum_{i=1}^{n} (x_i - \overline{x})^2 $$

To estimate the whole variance from sample variance, we need to use modified variance: $$ \sigma^2 = \frac {1}{n-1} \sum_{i=1}^{n} (x_i - \overline{x})^2 $$

Covariance

For smaples $\{x_i\}$, $\{y_i\}$, $$ Cov(x,y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y}) $$ $$ Cov(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY]-E[X]E[Y] $$

Expection and Variance

The expection: $$ E(X) = \frac{1}{n} \sum_{i=1}^{n} X_i $$

The variance: $$ D(X) = E((X-E(X))^2) = E(X^2 - 2XE(X) + E(X)^2) = E(X^2)-E(X)^2 $$

Basic Requirements for statistics

Unbiasedness

For any variables $x$ that is meaningful for statistics, they shall be satisfied as unbiased. In other word, they shall have corresponding expectation. $$E(x) = \mu$$

Distribution of Big Sample

Law of Large Numbers

The Law of Large Numbers(LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

The LLN is important because it "guarantees" stable long-term results for the averages of some random events. It's important to remember that the LLN only applies(as the name indicates) when a large number of observations are considered. There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be "balanced" by the others.

Central Limit Theorem

In probability theory, the Central Limit Theorem(CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined(finite) expected value and finite variance, will be approximately normally distributed, regardless of the underlying distribution.

To illustrate what this means, suppose that a sample is obtained containing a large number of observations, each observation being randomly generated in a way that does not depend on the values of the other observations, and that the arithmetic average of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the computed values of the average will be distributed according to the normal distribution.

The central limit theorem has a number of variants. In its common form, the random variables must be identically distributed. In variants, convergence of the mean to the normal distribution also occurs for non-identical distribution or for non-independent observations, given that they comply with certain conditions.

In more general usage, a central limit theorem is any of a set of weak-convergence theorems in probability theory. They all express the fact that a sum of many independent and identically distributed (IID) random variables, or alternatively, random variables with specific types of dependence, will tend to be distributed according to one of a small set of attractor distributions. When the variance of the independent and identically distributed variables is finite, the attractor distribution is the normal distribution. In contrast, the sum of a number of IID random variables with power law tail distributions decreasing as $|x|^{-\alpha-1}$ where $0<\alpha<2$ (and therefore having infinite variance) will tend to an alpha-stable distribution with stability parameter (or index of stability) of $\alpha$ as the number of variables grows.

Classical CLT

Let ${X_1,...,X_n}$ be a random sample of size $n$ -- that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values give by $\mu$ and finite variances given by $\sigma^2$. Suppose we are interested in the sample average $$S_n = \frac {X_1+...+X_n} {n}$$ of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value $\mu$ as $n \to \infty$. The classical central limit theorem describes the size and the distributional form of the stochastic flutuations around the deterministic number $\mu$ during this convergence. More precisely, it states that as $n$ gets larger, the distribution of the difference between the sample average $S_n$ and its limit $\mu$, when multiplied by the factor $\sqrt{n}$ (that is $\sqrt{n}(S_n-\mu)$, approximates the normal distribution with mean 0 and variance $\sigma^2$. For large enough $n$, the distribution of $S_n$ is close to the normal distribution with mean $\mu$ and variance $\frac{\sigma^2}{n}$. The usefulness of the theorem is that the distribution of $\sqrt{n}(S_n-\mu)$ approaches normality regardless of the shape of the distribution of the individual $X_i$'s. Formally, the theorem can be stated as follows:

Lindeberg-L$\acute{e}$vy CLT. Suppose $\{X_1,X_2,...\}$ is a sequence of independent and identically distributed random variables with $E[X_i] = \mu$ and $Var[X_i] = \sigma^2 < \infty$. Then as $n$ approaches infinity, the random variables $\sqrt{n}(S_n-\mu)$ converge in distribution to a normal $N(0,\sigma^2)$:
$$\sqrt{n} \left(\left(\frac{1}{n}\sum_{i=1}^{n}X_i\right)-\mu\right) \xrightarrow{d} N(0,\sigma^2)$$

In the case $\sigma > 0$, convergence in distribution means that the cumulative distribution functions of $\sqrt{n}(S_n-\mu)$ converge pointwise to the cdf of the $N(0,\sigma^2)$ distribution: for every real number $z$, $$lim_{n \to \infty} Pr[\sqrt{n}(S_n-\mu)\le z] = \Phi(z/\sigma)$$

where $\Phi(x)$ is the standard normal cdf evaluated at $x$. Note that the convergence is uniform in $z$ in the sense that $$\lim_{n \to \infty} \sup_{z\in \textbf{R}} |Pr[\sqrt{n}(S_n-\mu)\le z]-\Phi(z/\sigma)|=0$$

where sup denotes the least upper bound(or supremum) of the set.

Regression Analysis

Residue

The definition: $$ \hat{\epsilon} = y - \hat{y} $$

Mean Square Error (MSE)

The definition: $$ MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat{Y}_i - Y_i)^2 $$

Normal distribution test

$\chi^2$ Test

Kolmogorov-Smirnov Test

Lilliefor Test

Data Process

Data Clean and Missing Data

In the other words, missing data is one kind of feature of data set.

Add more features


In [ ]: